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Abstract 

To analyze whole-genome genetic data inherited in families, the likelihood is typically ob- 
tained from a Hidden Markov Model (HMM) having a state space of 2" hidden states where n 
is the number of meioses or edges in the pedigree. There have been several attempts to speed 
up this calculation by reducing the state-space of the HMM. One of these methods has been 
automated in a calculation that is more efficient than the naive HMM calculation; however, that 
method treats a special case and the efficiency gain is available for only those rare pedigrees 
containing long chains of single-child lineages. The other existing state-space reduction method 
treats the general case, but the existing algorithm has super-exponential running time. 

We present three formulations of the state-space reduction problem, two dealing with groups 
and one with partitions. One of these problems, the maximum isometry group problem was 
discussed in detail by Browning and Browning [5]. We show that for pedigrees, all three of these 
problems have identical solutions. Furthermore, we are able to prove the uniqueness of the 
solution using the algorithm that we introduce. This algorithm leverages the insight provided 
by the equivalence between the partition and group formulations of the problem to quickly find 
the optimal state-space reduction for general pedigrees. 

We propose a new likelihood calculation which is a two-stage process: find the optimal 
state-space, then run the HMM forward-backward algorithm on the optimal state-space. In 
comparison with the one-stage HMM calculation, this new method more quickly calculates the 
exact pedigree likelihood. 

1 Introduction 

Motivation. Statistical calculations on pedigrees are the principal method behind the most ac- 
curate disease-association approaches [151 [18]. In those approaches, the aim is to find the regions of 
the genome that are associated with the presence or absence of a disease among related individuals. 
Furthermore, pedigree likelihoods are used to estimate fine-scale recombination rates in humans [1], 
where there are few other approaches for making these estimates. Many implementations of the 
likehhood estimates for pedigrees exist [71[T1|16]. Estimates of probabihties on pedigrees are of great 
interest to computer scientists because they give an important example of graphical models which 
model probability distributions by using a graph whose edges are conditional probability events 
and whose nodes are random variables [12]. Methods for reducing the state-space of a pedigree 
graphical model could generalize to other graphical models, as noted also by Geiger et al [8]. 
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The Problem Summary. Hidden Markov Models (HMMs) analyzing the genotypes of related 
individuals have running time 0{m2'^) where m is the number of sites and n is the number of meioses 
in the pedigree. Therefore, it is desirable to find more efficient algorithms. Any partitioning of the 
state space into k ensemble states (i.e., states with identical emission probabilities and Markovian 
transition probabilities) will improve the running time of an HMM to 0{mk), even if the ensembles 
are not optimal. Since the HMMs have an exponential state space and a running time polynomial 
in the size of the state space, even an exponential algorithm for finding ensemble states can improve 
the running time of the HMM calculations. 

Literature Review. Donnelly [5] introduced the idea of finding ensemble states for the IBD 
Markov model, and used a manual method for finding the symmetries for several examples of two- 
person pedigrees. Browning and Browning j2] formalized the requirements for symmetries that 
describe ensemble states in a new HMM. They gave the first algorithm for finding the maximal 
set of isometrics that preserves the Markov property and the IBD information. Their algorithm 
which is based on enumerating permutations appears to have worst-case running time of at least 
of 0(n!2^"), where n is the number of meioses in the pedigree. However, the running time of their 
algorithm is difficult to analyze. They also left open the question of whether groups other than 
isometry groups could give useful state-space reductions [2]. 

McPeek [14] presented a detailed formulation of the condensed identity states and an algorithm. 
Most recently Geiger et al [8] discussed a similar problem using the language of partitions. In this 
paper, we will prove that the partition problem and the isometry problem are identical. Geiger, 
et al. gave a special-case state-space reduction involving only partitions that collapse simple lin- 
eages (multiple generations with a single child per generation and with the non-lineage parents 
being founders). Several people have introduced algorithms for finding symmetries for systems 
applications [131 HH] • 

Our Contribution. Inspired by the work of Browning and Browning [2], we look for maximal 
ensembles of the hidden states that can be used to create a new HMM with a much more effi- 
cient running-time. We introduce an improved algorithm for finding the maximal ensemble states, 
sets of hidden states, that preserve both the Markov property and the identity by descent (IBD) 
information of the individuals of interest. 

Related work includes the algorithms of Geiger, et al [8], and Browning and Browning |2]. 
Geiger, et al. found isometries of a limited type in O(n^) time and demonstrated their practical 
value. Browning and Browning introduced a super-exponential algorithm that finds a maximal 
group of isometries, i.e. the largest number of elements. However, they did not draw any conclusions 
about whether their method finds the group with the maximal orbit sizes. 

We introduce an 0(n2^") maximal-ensemble algorithm for finding a permutation group on the 
2"' vertices of the hypercube, and for producing the most efficient ensemble states (i.e. the smallest 
partition of the state-space that respects the IBD and Markov properties and has the maximal 
partition sets and minimal number of sets in the partition). We prove that the optimal partition 
is a solution to the maximal isometry group problem that Browning and Browning introduced. 
Therefore, both their algorithm and ours finds the optimal partition of the state space which can 
be described using a group of isometries having a maximal number of elements. However, our 
algorithm is much faster, having a coefficient n instead of n\. 

We also introduce a bootstrap version of the maximal-ensemble algorithm which takes advan- 
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tage of the isometries introduced by Geiger, et al. [8] and the well-known founder isometry. By 
enumerating one representative from each set of the partition induced by the known isometries, 
we can create a bootstrap maximal-ensemble algorithm that runs in 0(?i/c2") time where n is the 
number of meioses in the pedigree, and k is the number of partitions from the known isometries. 

2 Problem Description 

Consider a pedigree graph, P, having individuals V as nodes and having n meioses with each 
meiosis being a directed edge from parent to child. Let I being the set of individuals of interest, 
perhaps because we have data for those individuals. While it might be algorithmically convenient 
to assume that I = V, it is impractical. Many of the ancestral individuals in the pedigree are likely 
deceased, and genetic samples are unavailable. 

An inheritance state or vector is a binary vector x with n bits where each bit indicates which 
grand-parental allele, paternal or maternal, was copied for that meiosis. The equivalent inheritance 
graph, R^, has two nodes per individual (one for each allele) and edges from inherited parental alleles 
to their corresponding child alleles. Individuals of interest are called identical by descent (IBD) if 
a particular founder allele was copied to each of the individuals. In general, the inheritance graph 
is a collection of trees, since each allele is copied from a single parent. 

The set of all inheritance states (binary n- vectors) is the n-dimensional hypercube Tin, with 2" 
vertices. The inheritance process is modeled as a symmetric random walk on Tin, with the time 
dimension of the walk being the distance along the genome. At equilibrium, the walk has uniform 
probability of being at any of the hypercube vertices. From vertex x in Tin, a step is taken to a 
neighboring vertex after an exponential waiting time with parameter X = n. For each individual 
zygote, with one meiosis, this is a Poisson process with parameter A = 1 and genome length roughly 
30. 

There is a discrete version of this random walk, which is often used for hidden Markov models 
(HMMs) that compute the probability of observing the given data by taking an expectation over 
the possible random walks on the hypercube. Let X be a Markov process, {Xt : t = l,2,...,m} 
for m loci with a state space Hn consisting of all the inheritance states of the pedigree. The 
recombination rate, 6t, is the probability of recombination per meiosis, between a neighboring pair 
of loci, t and t -|- 1. If i and t -|- 1 are separated by distance d, then the Poisson process tells us that 
the probability of an odd number of recombinations is 6t = 1/2(1 — e"^'*''^). The natural distance 
on Tin is the Hamming distance, |x © y|, for two states x and y, where © is the XOR operation and 
I • I is the L-'^-norm in M". Then the probability of transitioning from x to y is 

Pr[Xt+, = y\xt = x] = ^!"®^i(i - etr-\-®yK 

Figure [2] shows an example HMM with three genomic sites. The states of the HMM are shown in 
circles on the right. 

We define potential ensembles of states as being the orbits of a group of symmetries. Let G 
be a group that acts on the state space Tin of X. A symmetry is a bijection ip £ G where is a 
permutation on 2" elements, the vertices of Tin- An orbit of G acting on Tin is the set 

oj{y) = {x\x = ip{y) and tp G G}, 

and we write the set of all orbits of G as i}{G) = {uj{y) '■ y € Tin}- 
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Figure 1: Two Half-Siblings. (Left Panel) A pedigree with two non-founders of which two are 
half-siblings together with their common parent. Circles and boxes represent female and male 
individuals, while the two black dots for each person represent their two chromosomes or alleles. 
Edges are implicitly directed downward from parent to child. The alleles of each individual are 
ordered, so that the left allele, or paternal allele, is inherited from the person's father, while 
the right, maternal allele is inherited from the mother. The two siblings are the only labeled 
individuals. Their genomes are shown in color so that the same color indicates inheritance from 
the same ancestor. For convenience, the genotype of each person is homozygous. (Right Panel) 
The HMM for the genotypes from the left panel. At each site in the genome, the possibles states 
are the vectors in T-in- In each circle an inheritance state is drawn as an inheritance graph and 
the inheritance states for a sigle site are arranged in a column. The allowed transitions between 
neighboring sites are a complete bipartite graph (due to space, only a fraction of the edges are 
drawn). The nodes with a slash through them are inheritance states that are not allowed by the 
data. The red nodes and edges are the path for the actual inheritance states indicated by the yellow 
and blue in the left panel. However, this is only one of several paths of inheritance states that are 
consistent with the data. 

Conventional algorithms for computing likelihoods of data have an exponential running time, 
because the state space of the HMM is exponential in the number of meioses in the pedigree. We 
propose new ways to collapse hypercube vertices into ensemble states for a new HMM that has 
a more efficient running time. In particular we are interested in optimal ensemble states that 
preserve certain relationship structures: the Markovianness of the random walk and the emission 
probabilities. We will first discuss the Markov property and then discuss the constraints on ensemble 
states that the emission probabilities provide. 

2.1 Markov Property 

Let {Xt} be a stationary, reversible Markov chain with state space $7, such as the chain correspond- 
ing to the hidden states of the pedigree HMM. 

Let y be a new processes, {Yt : t = 1,2, ...,77t,} having states 0(G) = {oji, w^} which are 
the orbits of some group G. This new Markov chain is coupled to the original such that when 
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Xt = X (z 00 £ ^{G), Yf = uj, and is a projection of Xt into a smaller state space. Define the 
transition probabilities for process Yt as 

Pr[Xt^i E ujj\Xt = x] = Pr[Xj_|_i = y\Xt = x] for x E Wj, for uji,u}j E r2(G). (1) 

We will call Yj the expectation chain since 

Pr[Yt+i = ujj\Yt = co^] = E[Ej\Xt = x], 

where Ej is the event that Xt-^i E ujj. 

Since Xt is stationary and reversible, the necessary and sufficient condition [3] for Yj to also be 
Markov is that 

Pr[Xt+i = y\Xt = xi]=Y^ Pr[Xt+i = y\Xt = X2] (2) 

for all xi,X2 E for all i, and for all ujj. Therefore any group whose orbits satisfy this set of 
equations can be used to create a new Markov chain Yf. 

From Equations ([T]) and ([2]) we see that the stationary distribution of Markov chain Yt is 
Pr\Yt = Ui] = 'Ylyeui- % where vTy is the stationary distribution of Xt- For pedigree HMMs, the 
stationary distribution of Xt is uniform, vTy = 1/2"', therefore the expectation chain for some group 
that satisfies Equation ([2]) will have a stationary distribution Pr\Yt = Wj] = |cjj|/2". 

For pedigree Markov chains. Equation ([2]) becomes, for s = 9/(1 — 9) and < 6 < 0.5, 

g\yex,\ ^ ^\y(Bx2\ v^^^^2 E Ui. (3) 

y€LUj yeujj 

If the expectation chain Yt corresponding to pedigree Markov chain Xt satisfies this equation, we 
say that it satisfies the Markov property. Notice that these polynomials are identical if and only if 
the coefficients of like powers are equal. 

Browning and Browning [2] gave an algorithm that searches for a maximal group of isometries 
where the group was maximal in the number of group elements. A group, G, of isometries has 
orbits r2(G) = {wi, ...,0;^} such that \T{x) © T{y)\ = \x (B y\ for all T E G, y E ujj and x € Ui for 
all This means that the transition probabilities are related by 

Pr[Xt+i = y\Xt = x] = Pr[Xt+i = T{y)\Xt = T{x)]. (4) 

Browning and Browning left open the question of whether any symmetry groups satisfying Equa- 
tion ([3]) must be equivalent to a group of isometries (meaning that it has the same orbits). We 
answer this question. Theorem [1] proves that for any permutation satisfying Equation [3l there is 
always a group of isometries having the same orbits as the permutation sets. 

Theorem 1. Let S be a group of permutations of Tin whose orbits Q{S) satisfy Equation Then 
there exists a group of isometries G having the same orbits as S: that is, for every T £ G and all 
x,y G Hn, \y S) x\ = \T{y) © T{x)\, and the set of orbits ^1{G) is equal to fl{S). 

Proof. We prove this by constructing a generating set A for G. First, let the identity permutation 
TTe be in A. Then for each orbit co of S, and each pair of points xi and X2 in ui, we will construct 
a permutation 'Trx-^^X2 to add to the generating set A. If xi = X2, then 7rxi,x2 — which is already 
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in A. If xi 7^ X2 then 7ra;^^a;2 will be a composition of disjoint two-cycles, in particular including 
the cycle (xi X2)- Our generating set A will then be the union of all these permutations, so by 
construction it will generate a group G = (A) having the same orbits as S. 

For fixed xi,X2 € w, the two-cycles comprising tTx-i^^x2 are constructed as follows: 
For each k = 1, . . . ,n, define := #{y G oj : \y ® xi\ = k} and bk ■= #{2 £ to : \z ® X2\ = k}, 
which implies by Equation ([3]) that akS^ = b^s^ for each k, and hence = bk, since s > and 
polynomials in s are uniquely determined by their coefficients and powers. Then, for each given 
yi £ u) that is distinct from both xi and X2, there exists zi such that \yi © xi| = \zi (B X2I = k, 
because > 1, a consequence of the fact that yi £ A^ ■= {y £ to : \y (B xi\ = k}. In other words, 
zi '■= yi ® {xi ® X2), and the cycle is ci := {yi zi). 

Proceed similarly for 7/2 £ 'Hn \ {xi-,X2-,yi, zi}, defining Z2 ■= 2/2 © (xi X2), and the cycle 
C2 := iy2 Z2), and so on for each yi £Kn \ {xi,X2,yi, zi, . . . , with Zi := y^ {xi X2) 

and Cj := {yi Zi). Then we define the permutation ■Kxi,x2 '•— ci o C2 o ... o 02^. In particular it has 
the cycle {xi X2) in its composition, since when y = xi, we have z = X2. Notice also that the 
definitions of Zj imply that 



Hence by taking L norms, the permutation tTxi,x2 is an isometry with respect to Hamming distance. 

Furthermore, the group G = {A) will have the same orbits as S, since for each orbit oj and 
each pair xi,X2 £ w, the cycle (xi X2) will appear in some permutation, and no pair of points from 



This proof complements the result from Browning and Browning regarding the fact that isom- 
etry groups always satisfy Equation [3l Indeed, we will state the complete result as a corollary. 

Corollary 2. A group S has orbits 0,{S) satisfying Equationl^ if and only if there is an isometry 
group G whose orbits ^}{G) are identical to 0,{S). 

Proof. Browning and Browning [2] showed that all isometry groups G satisfy Equation[3l Theorem[T] 
completes the proof. □ 

It is a well-known fact in algebra that any partition can be the orbits of some symmetry group, 
and that the orbits of any symmetry group are a partition [6]. We will recapitulate this simple 
result next. 

Corollary 3. A partition satisfies Equation [21 if and only if it is equivalent to the orbits of some 
isometry group. 

Proof. Assume we are given a partition {Wi, ...,Wk} of set Hn where Wi U Wj = 0, UiWi = Tin 
and the partition satisfies Equation [3l We will create a symmetry group S whose orbits ^}{S) = 
{Wi, ...,Wk}. This is easily done. For each set in the partition Wi, create a permutation with a 
single cycle vTj = (yi y2 ... yi) where all yj £ Wi. Make a generating set A = {vTj : 1 < i < A;} U vrg 
where vrg is the identity permutation. Then group S = {A) clearly has orbits Vl{s) = {Wi, Wk}. 
By Theorem [1] there is an isometry group with the same orbits. 

Assume we are given an isometry group G. Clearly, by Browning and Browning's proof [2], the 
orbits define a partition that satisfies Equation [3l □ 



yi ® yj =yi®yi® yj ®yi = Zi® zi® Zj ® zi = Zi® zj- 

yi © Zj = yi®xi® Zj © xi = Zj X2 yj ®X2 = Zi® yj . 



(5) 
(6) 



different orbits will appear as a cycle in any permutation. 



□ 
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Browning and Browning [2] also showed that any isometry T : Hn T~Ln can be uniquely 
written as T = tt o 0^ where vr is a permutation on n elements, the bits of the hypercube vertex, 
and (j)a is a switch function where </>a(x) = a © x where © is the bit-wise XOR operation. 

An isometry describes some aspect of the pedigree graph. For example, an isometry consisting 
of a switch and the identity permutation can be used to enumerate one element from each orbit by 
simply fixing the 1-bit 's value and then enumerating all possible values for the other switch bits. 
On the other hand, an isometry consisting of the identity switch (all zero) and a permutation of one 
cycle can be used to enumerate one element for each orbit by listing the 1-prefixes of the permuted 
bits, (i.e. for three bits, the representatives are 000, 100, 110, and 111). 

2.2 Emission Property 

The Markov property is not enough to ensure that the HMM based on Markov chain Yt has the 
same likelihood as the HMM for Xt. In order to ensure this, we introduce a property on the emission 
probabilities, namely that all the elements in one orbit must have identical emission probabilities. 
We call these orbits the emisison partition, since they are induced by the emission probability. In 
order to define this object, we need to introduce some more notation. 

Recall that Rx is the inheritance graph for inheritance path x. The relationship structures we 
wish to preserve are the IBD relationships on the individuals of interest /. Let Im be the maternal 
alleles of all the individuals of interest and Ip be the paternal alleles of all the individuals of interest. 
The inheritance graph Rx is a forest of trees; let CC{Rx) refer to the connected components of Rx 
which are labeled with Im U Ip. The same-labeled connected components induce a partition 

Dx = {ye Hn\CC{Ry) = CC{Rx)}. 

We call the parition {Dx\^x} the identity states, since it indicates a particular identity-by-descent 
(IBD) relationship among the labeled individuals. These have been well studied [9| 117^ [TT]. 

Looking at a small example, containing two siblings who are the individuals of interest and 
their two parents, we see that the identity states are: 

Dqqqq = {0000, 0101, 1010, nil}, 

DiQQQ = {1000,0010,1101,0111}, 
DoioQ = {0100, 1110, 0001, 1011}, 

DiiQQ = {1100, 0110, 1001, 0011}. 

But if we think carefully about this example, there is symmetry in the pedigree, namely swapping 
the two parents, that does not appear in this partition. Due to this reason, we need to consider 
the following object. 

Let Pr[D\Xt] be the probability that the state Xt of the HMM emits the observed data D in 
state t. Then the partition induced on the state space by the emission probablity is the emission 
partition containing all distinct sets Ex'. 

Ex = {y£ Un I Pr[D = d\Xt = x\ = Pr[D = d\Xt = y] Vd} 
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and 

Pr[D = d\Xt=x\= ^ n P'-^^d)] 

d consistent with Ry c£CC{Ft.a:) 

where d is a vector of sets, d is a vector of tuples that is an ordered version of d, meaning that = di 
while removing the order information from di, and c{d) gives the allele of d that is assigned to that 
connected component, and h{d) is the number of heterzygous sites in d. Note that each connected 
component is a tree, and has exactly one founder. Also, the identity states are consistent with these 
probabilities, but the identity states are a sub-partition of the emission partition. Specifically, from 
our previous example, 0100 ^ -DiooO) but 0100 G -Eiooo- Indeed, the emission partition for the 
example is Dqooo, {-Diooo, -Dqioo}, Duqq. 

We say that the expectation Markov chain Yj satisfies the emission property if and only if it 
preserves the emission partition in order for the corresponding HMM to have the correct likelihood. 
To preserve the emission partition, all the group elements T € G must satisfy T{y) € for all 
y G Ex and for all x. 

Now, it is necessary to compute the Ex quickly. The naive algorithm would be slow, since we 
would have to consider all pairs x, y and all possible data d. Neither can we use the methods in 
the literature dealing with condensed identity states [9l [T7\ 111] , because the literature computes 
pedigree-free condensed identity states. That calculation takes the sets from the identity states and 
applies permutations of the form vTj = (i^ if) to swap the alleles of an individual of interest i £ I. 
However these permutations can violate the inheritance rules specified by a fixed pedigree. For 
the example above, take vector 1010 G -Dqooo and swap the alleles of the second child -7r2(1010) = 
1001 G -Diioo- This clearly produces a partition that is not the emission partition, and so it would 
violate the property that we wish to enforce. Several works on optimal state space reduction for 
pedigree HMMs have discussed the condensed identity states [21 [H] for state-space reduction. It 
would appear that they did not formulate the emission partition that was mentioned by Geiger, et 
al. [8] and that is used here. 

The main difference between D and E partitions is that the probability Pr[D = d\Xt = x] has a 
product over indistinguishable connected components, whereas the identity states distinguishes each 
connected component. This partition must answer the question of which connected components 
are exchangeable. Let /' be the individuals of interest having parents who are not individuals of 
interest. So, we can rewrite Ex as follows: 

Ex = {y £ Tin I 30 a proper isomorphism s.t. CC{Rx) = CC{(j){Ry))} 

where an isomorphism (p is proper if and only if (j) is an isomorphism from Ry to Rx where for all 
i £ I' U V \ I, either (j){if) = if and (j){im) = im or (/>(i/) = im and (t){im) = V- This definition of 
Ex is easier to compute, because now we can do an 0{n) check to see if the forest of trees in x and 
y are isomorphic, which leads to an 0(n2^") calculation. However, we can do better. 

From the above definition, we see that in order for two inheritance paths to be isomorphic, 
the pedigree graph itself (as opposed to the inheritance graph) must have a automorphism. If we 
can identify all the relevant automorphisms for the pedigree graph, then we can make a set A of 
permutations (one for each automorphism), and use a group theoretic algorithm for obtaining the 
orbits of {A) acting on the partition {Dx j Vx S Hn} to obtain the desired emission partition. 

First to obtain the automorphisms of the graph, we will employ a naive strategy. Let i G 
/' U y \ / be an individual of interest. Recall that any proper isomorphism must map one branch 
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of i's ancestral lineage to the other branch. In order to be consistent, for the set J = {i} U 
{j I j full sib of i}, the automorphism must (j){jm) = jf for j € J. Considering i's parents and 
proceeding backward in time, the sub-pedigree connected to the ancestors forms a directed acyclic 
graph (dag) with in-degree two. Without loss of generality, we can assume that this sub-pedigree 
has no individuals in / \ {i}, because if there were there would be no proper automorphism, and 
there is a descendant of the ancestors not in / it can be trivially removed from the pedigree |14j . 
Therefore, we may consider only the tree of direct ancestors branching backward in time. At each 
branch point, 6, in this tree, we assign an indicator jb = 1 if the father is to the left and the mother 
to the right. There are 0(2"') assignments of these variables {'jh \ V6}. For each possible assignment, 
perform an 0{n) graph-traversal operation to check whether this assignment is an automorphism. 
We take the first automorphism (p that we find, because any other (p' from the same lineage will 
satisfy CC{4){Rx)) = CC{(j)'{Rx)) for all inheritance paths x. 

Now that we have the automorphisms, we can write them as isometries and put them in set A 
and consider the orbits of the group (A) acting on the identity states. These orbits are the emission 
partition. To obtain these orbits, we will use the well-known orbit algorithm from computational 
group theory. Notice, that we with to apply this algorithm to the existing partition M := {Dx\yx}. 
Take one set Dx € M and initialize its orbit as Ox ■= {Dx}- At the end of the following procedure 
Ox will contain all the elements in x's orbit. For every element Dx € O and every automorphism 
permutation a € A, compute y := a{z) \/z G Dx- If y ^ Ox, then this y and all the elements in its 
set Dy are added to Ox and Dy is removed from M. This procedure is repeated until M is empty. 
Notice that CC{y) = CC{a{z)) is compared to CC{x) to determine if y is is in Ox- 

Since the comparison CC{y) = CC{x) can be computed in linear time. The running time to 
obtain the automorphisms is 0(n2") and the orbit algorithm runs in 0{n2^) time. This means 
that obtaining autormorphisms of the pedigree is preferable to checking pairs of inheritance vectors 
for isomorphism. 

2.3 Examples 

We will consider two examples, here. The first is a three generation pedigree while the second is a 
result that applies to all two generation pedigrees. 

2.3.1 Three-Generation Pedigree 

For example, given 4 meioses for two half-cousins, A and B, with one shared grandparent, their 
common grandparent and their respective parents who are half-siblings, we have 16 hypercube 
vertices (see Figure [2.3.ip . Our individuals of interest are I = {A,B}. The emission partition is, 
in this case, identical to the identity states and contains the sets Ei = {Ap}{Am, Bm}{Bp} and 
E2 = {Ap}{Am}{Bm}{Bp}, since these are the only partitions of alleles of individuals / that have 
non-empty sets in the emission partition. The emission partition induced on the hypercube vertices 
is: Ex, = {1001, 1111} and Ex, = Un \ Ex,. 

Notice that in this instance we cannot use the emission partition {Ex\ Vx} as the state space 
of a new Markov chain. For example, if we were to let Zt be a Markov chain on the partition 
given by the emission partition, then the Markov criteria would fail to hold. Specifically, consider 
state xi = 0001 and X2 = 0011. Then by checking Equation ([2]), we have Yly&Ex ^'''[■^t — y\-^t = 

0001] = 9(1 - ef + 9^1 - 9) but Y.yeEx, ^^I^* = y\^t = 0011] = 2 • - Of- 
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The largest partition of Tin that satisfies the Markov criteria is Pj = {1001,1111}, Pr = 
{0010,0100}, Pg = {1011,1101}, Pb = {0000,0110}, Pk = {0011,0101,1010,1100}, and Pl = 
{0001,0111,1000,1110}. Let H be the matrix of pair-wise Hamming distances between all the 
vertices of the hypercube. Then the transition probabilities take the form: For example, Pr\Yt-\-i = 
PilYt = Pk] = 2e{l - 0)3 + 203(1 _ 0). 




A B A B 



Figure 2: Two Half-Cousins. (Left Panel) A pedigree with four non-founders of which two are 
half-cousins together with their common grandparent. As before, the two black dots for each person 
represent their two alleles, and the alleles of each individual are ordered, so that the left allele, or 
paternal allele, is inherited from the person's father, while the right, maternal allele is inherited 
from the mother. The two cousins are labeled A and B. It is easy to see that the only possible IBD 
is between alleles A^ and Bm, the maternal alleles of individuals A and B, respectively. (Right 
Panel) This makes the four male founders irrelevant to the question of IBD. The four meioses are 
labeled in the order of their bits, left-to-right, and the inheritance states are represented in binary 
as 2:1X2X3X4. Let Xj = if that allele was inherited from the parent's paternal allele, and Xj = 1 if 
from the maternal allele. For instance, A and B are IBD only for inheritance states 1001 and 1111. 

Notice that this partition can be expressed as the orbits of a group of isometrics, because 
G = ((1 4), (2 3),(/)oiio) does not violate the IBD class. 

2.3.2 Two- Generation Pedigrees 

Lemma 4. For any two-generation pedigree, the partition defined by the emission partition, C = 
{Ex\ Vx}, satisfies the Markov Property. 

Proof. We can establish this by finding a group of isometrics whose orbits are the emission partition. 
This group has the generating set A where A = {(f)f : V/} n {vTm : Vm} and (j)f and -Km are defined 
as follows. For founder /, (j)f is a switch having bits set as follows. Let ii,..,ic be the meioses from 
founder / to each of the founders c children. Then (j)fi = 1 if i = ij for some j and (j)fi = otherwise. 
Let m = (/i, 72) which are untyped monogamous married founding pairs. Then -Km. = ci 0C2 o ... oc^ 
is a permutation composed of k disjoint cycles, one for each child. For child i with meiosis bits 
io,ii, Ci = {io ii). The group of isometrics G = (A). 

Now, we simply need to establish that the emission partition C are the orbits of this group G. 
There is no element T £ G that maps x € -E^i to y € Ex2i since every (pj and Tim map the bits of 
X in ways that maintain GG{Rx). Now, we simply need to show that for any xi,X2 € E^, there is 
always some element T £ G such that y = T{x). Consider each connected component in CC{Rx) 
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where x and y differ. The aheles connected in this connected component must all share inheritance 
through one of the founder bits of the common parents. If there is only one common parent, the 
switch for that founder must map between x and y in the bits for that connected component. If 
there are two common parents, then there must exist a composition of two founder switches and 
the founder permutation that maps between x and y for the bits in that connected component. The 
complete map T is simply the composition of the isometries for each connected component. □ 

In the next section we will introduce the Maximal Ensemble Problem, and we will soon see that 
this lemma provides a fast method to obtain the optimal partition for two-generation pedigrees. 

2.4 The State-Space Reduction Problem 

There have been three state-space reduction problems posed, we restate these here. Given the 
original pedigree state space Tin, there are three ways to reduce the state space. 

Maximum Ensemble Problem [8j Find the partition, {VFi,...,Wfc} of Tin that satisfies both 
the Markov property and the emission property and that maximizes the sizes of the sets: 

max^til^il- 

Maximum Isometry Group Problem [2] Find the isometry group G of maximal size whose 
orbits satisfy the emission property. 

Maximum Symmetry Group Problem Find the symmetry group G of maximal size whose 
orbits 0,{G) satisfy both the Markov property and the emission property. 

We have already proven that all symmetry groups that satisfy the Markov property have an 
isometry group with equivalent orbits. This means that the later two problems are identical. Indeed 
since these last two problems are equivalent, we will refer to them collectively as the Maximum 
Group Problem. The remaining question is the relationship between the maximum ensemble 
problem and the maximum isometry group problem. We will first introduce a Maximum Ensemble 
Algorithm and use it to prove that the solution to the Maximum Ensemble Problem is unique. 
Using the uniqueness result, we will be able to prove the equivalence of the Maximum Ensemble 
and Maximum Isometry Group Problems. 

3 Maximum Ensemble Algorithm 

We will introduce an algorithm that solves the Maximum Ensemble Problem and that helps us 
establish the uniqueness of the solution. We will then use uniqueness to prove that this algorithm 
also solves the Maximum Group Problem. 

Consider the emission partition containing, for all x of interest. Of course the sets in the 
emission partition are disjoint. Consider the (2")! permutations on the vertices of the hypercube. 
Naively, these are all candidate permutations for our group, if we wish to find the maximal group. 
However in this section, we focus on finding the partition that yields the maximum ensemble 
solution. Given the state space, the partition can be found in linear time. 

We do this by iteratively sub-partitioning the partition according to the coefficients and powers 
appearing in Equation [3l See Algorithm [TJ Bipartition which takes as input a subpartition of the 
emission partition. This recursion is possible since the Markov property must produce a partition 
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that is a sub-partition of the emission partition (i.e. in order to respect the emission partition). 
Indeed, as shown in Lemma El any pair of vectors xi,X2 that violate the Markov property must 
appear in separate sets of the partition. This recursive approach wih at worst produce a partition 
with each element in its own set. 

Algorithm [1) Bipartition only needs to compute the 2" x 2" matrix of distances between IBD 
vectors, as well as do some bookkeeping. So, the total running time is 0(2^""). Since the iterative 
sub-partitioning at minimum splits sets in two and does not introduce new inequalities, the num- 
ber of iterations of the partition algorithm is 0{log{2^)) = 0{n). One iteration of Algorithm [T] 
Biartition requires 0(2^") time for each iteration, since we have to check the 2" x 2" matrix of 
distances between partition elements. So, the total running time is 0(n2^"). 

Now, we need to establish the correctness and uniqueness of the partition. 

Lemma 5. Let Wi,Wj be two sets of the partition such that xi,X2 G Wi and xi,X2 violate the 
Markov property in Equationl^ i.e. such that 



yeWj yeWj 



Then even ifWj is subdivided, xi,X2 continue to violate Equation\^ 
Proof. This is proven by a simple property of polynomials. Since 

g\y®xi\ _^ Y2 g\y®^2\^ 

yeWj yeWj 

there must be at least one power for which the polynomial coefficients disagree. Let Ok and bk be 
the coefficients from the left- and right-had sides respectively. Let A{k) = {y : \y(Bxi\ = k}, so that 
Ofc = |j4(/c)|, and let B{k) = {y ■ \y ® X2\ = k}, so that bk = \B{k)\. Let C,D be any bipartition 
of Wj. Therefore C and D induce a partition of A{k) and B{k). Specifically A{k) is partitioned 
into sets A{k) n C and A{k) n D, while B{k) is partitioned into B{k) n C and B{k) n D. Since 
|^(A:)| 7^ \B{k)\., then at least one of 

\A{k)^^C\ ^ \B{k)f^c\ 

\A{k)f^D\ ^ \B{k)nD\. 



or 



Therefore at least one of 

yec yeC 

or 

y&D yen 

□ 

Lemma 6. (Loop Invariant.) Once Qo is added to P' , it is never subdivided again in any iteration. 
This is equivalent to stating the invariant that for any i, 

Y sls'®^!! = s^y®''^^ V xi,X2 G Cio V Wj G P' 

y&Wj y&Wj 
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Algorithm 1 Bipartition(P) in 0(2^") time 
input: 

P: current subpartition of the emission partition 
output: 

P': violates fewer equations of the Markov property 
main: 
P' = 

foreach VFj G P do 

Cio = Wi 
Cii = 

foreach Wj G P do 

ttk = for all < A; < n 

Sx' = for all x' € Cjo 

Let xi e Cio be a fixed element of CiQ. 

foreach x S Qo do 

bk = for all < A; < n 
foreach y G Wj do 
Let k = \y Q x\ 
if X == xi then 

Ofc + + 
end if 

bk + + 
end for 

if ttk 7^ bk for some < k < n then 

Sx = l 

end if 
end for 

{Bipartition Wi} 
foreach x G Cjo do 

Cio ^ Cio \ {x} 

Cs^. ^ Cs^ U {x} 
end for 
end for 

P' ^ P'u{Cio,Ca} 
end for 
RETURN P' 



Proof. Notice that the above invariant is a consequence of both the loop "foreach Wj € P" and of 
the Bipartition algorithm. For the base case Cjo = Vi, and the invariant holds trivially. 

Now we need to inductively prove that the invariant holds. Assume that for some i, the invariant 
holds. Now, consider the loop for a fixed Wj € P. Wj may be partitioned into some Cjo and Cji. 
Our task is to prove that for the new partition of Wj the invariant holds, i.e. that 
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From the invariant, we have X^j^g^. s'^®^^' = ^y^^/. sl^®^^! yxi,X2 G CiQ. Fix k and define 
the set 

A{k,xi) := {y e Wj : ly® xi \ = k} V xi € C^o, 

then the coefficient of the kth power in the equation is \A{k, xi)\. Futhermore, we have \A{k, = 
\A{k,X2)\ for all xi,X2 G CiQ. 

Notice that Cjq was created with the property that ^^^^^q^^ gl^iffis/il = g\^i<Sy2\ fQj. 

2/1)2/2 G CjQ. Define the set 

B{k,xi) := {yi G Cjo : \xi ® yi \ = k} V xi G Qq, 

and its mirror set 

D{k,yi) := {xi G Qo : |xi 2/i| = /c} V yi G Cjo- 

Notice that A{k,xi) PI Cjq = B{k,xi) for all xi G Cjo- 

Now we will use the property \D{k, yi)\ = \D{k, 2/2)! for all yi, y2 G Cjo to prove that \B{k, xi)\ = 
\B{k,X2)\ for all xi,X2 G CiQ. Let : Cjo — > Cjo be a bijective map on Cjo such that (j){xi) = X2- 
Pick a bijective map vr : Cio — >■ Cjo that maps elements of D{k, yi) to elements of D{k, (f){yi)). Now, 
we will show that yi G B{k,xi) if and only if (j){yi) G i?(/c, 7r(xi)). Now yi G B{k,xi) = A{k,xi) PI 
Cjo, so this is equivalent to xi G D{k,yi), which in turn is true if and only if tt{xi) G D{k, cp^yi)), 
or if and only if (j){yi) G ^(A:, 7r(3;i)). Then since (j){yi) G Cjo, we have that (j){yi) G B{k, 7r{xi)). 

This proves that |i?(/c,xi)| = \B{k,X2)\ for all 3;i,X2 G Cjo- Therefore we have 

^ = ^jB(A:,xi)|s'^ VxiGCo. 

Therefore, we have the invariant that 

g\ymi\^ g\y9x2\ Vxi,X2GCio 

□ 

Theorem 7. (Uniqueness of Solution.) The Maximum Ensemble Algorithm finds the unique solu- 
tion to the Maximum Ensemble Problem. 

Proof. The partitioning algorithm produces a partition that respects the emission partition, since 
it begins with the partition given by the emission partition and sub-partitions it. The algorithm 
also produces partitions that respect the Markov property, since it iteratively sub-partitions the 
emission partition until the Markov property is satisfied. Notice that the algorithm is guaranteed 
to find such a partition since the trivial partition, i.e. the original state space, satisfies the Markov 
property. Since partition sets are only divided if they violate the Markov property, the algorithm 
necessarily finds an optimal partition. Only the proof of uniqueness remains. 

By Lemma [5] the solution is invariant to the order in which the bipartitions are made, since 
any xi,X2 which violate the Markov property must be put into separate sets of the partition at 
some point. Indeed, by Lemma [6] we know that once Cio is created, it is never partitioned again. 
Since we begin with a unique partition, the emission partition, the sequence of CiQ, created by 
different calls to Algorithm [H will be the final sets in the partition, up to reordering. Therefore the 
Maximum Ensemble Algorithm finds the unique partition which is the solution to the Maximum 
Ensemble Problem. □ 
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4 Equivalence 



Now, using the uniqueness of a partition as the solution to the Maximum Ensemble Problem, we 
can prove equivalence of the Maximum Ensemble Problem and the Maximum Isometry Group 
Problem. 

Theorem 8. (Equivalence of Maximum Ensemble Problem and Maximum Isometry Group Prob- 
lem) A partition {Wi,W2, ■■■,Wk} is a solution to the Maximum Ensemble Problem if and only if 
there is an isometry group G that is a solution to the Maximum Group Problem having orbits Q{G) 
equivalent to the partition: for all oj, we have uj € if and only if there exists a set in the 

partition Wj such that Wj = uj. 

Proof. First, due to Corollary [3l we know that only isometry groups satisfy the Markov property. 
Any partition which is a solution for the Maximum Ensemble Problem is also, in particular, the 
orbits of a group of isometrics, G. Assume that G is not the maximal isometry group. Because, 
if not, then there must be some isometry which can be added. And, if it were added, it would 
join two orbits into one. Therefore joining two sets of the partition into one, which contradicts the 
assumption that the partition was maximal. Furthermore, since G satisfies the emission property, 
it's orbits must be a subpartition of the emission partition. There is no other group G' with larger 
size, since the solution to the Maximum Ensemble Problem is unique (Theorem [7|) . The Maximum 
Ensemble Problem is a solution to the Maximum Group Problem. 

For the converse we argue by contrapositive. Assume that partition {Wi,...,VFfc} is not a 
solution to the Maximum Ensemble Problem but that it satisfies Equation [3l Then there must 
exist some partition {Vi, V/} such that "^1^=1 < Sj=i l^ l- This inequality is strict by the 
uniquness Theorem [71 There must exist some i, i', and j, such that Wi C Vj and Wj/ C Vj. 

By Corollary [3l there are groups G^ and G^ with orbits {Wi, Wk} and {Vi, Vi}, respec- 
tively. Choose xi G n Vj and X2 € Wj/ fl Vj. Then '7Tx-i^,x2 from Theorem [1] will be in G^ and 
not in G^ . Therefore, G^ is not a solution to the Maximal Isometry Group Problem; proving the 
claim. □ 

5 Bootstrapping with Known Isometries 

As noted by Geiger et al. [8], there are two types of isometries that can be detected easily. There 
are the founder isometries and the chain isometries where there is an outbred lineage consisting of 
multiple ungenotyped generations. 

The founder isometries apply only to ungenotyped founders and are switches on the bits for the 
edges adjacent to the founder. Specifically, if ii, ic are the meiosis bits between the ungenotyped 
founder and each of the c children of the founder, then the switch is given by the bit vector Xi = 1 
if i = ij for some j and Xj = otherwise. Since the founder alleles are indistinguishable (due to 
the missing genotype), we can fix one bit adjacent to the founder and enumerate the other bits 
adjacent that founder. These founder isometries can be found in 0(n) time. 

The chain isometries apply to a lineage of / individuals, from oldest to youngest ii,i2,...,ii 
where each individual has exactly one parent from the lineage, one founder parent, one child, and 
no siblings, except ii which may have any number of siblings. All individuals except the most 
recent must be ungenotyped. The isometry is then the permutation on every bit, except the oldest, 
i.e. vr = (li Z2 is ... ii) Please see Geiger, et al. [8] and Browning and Browning [2] for examples. 
These chain isometries can be found in O(n^) time. 
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It would seem that there are other classes of isometries which can be found quickly, such as the 
permutations shown in the example in Section 12.31 The exact algorithms for finding other classes 
of isometries remain an open problem. Furthermore, it is unknown whether all the isometries in 
the maximal group can be found efficiently. 

5.1 Representatives 

Let ^ be a generating set of isometries that generate group G = {A), such as the founder and chain 
isometries. In order to compute the bootstrap maximum ensemble states. We need to obtain the 
orbits of G acting on T-L^- We can obtain them in 0(/c|A|o) time where k is the number of orbits 
and o = maXa;g-^„ |a;(a;)|, provided that orbit membership can be checked in constant time. 

Let M = Tin. initially. We take any vector x out of M and find its orbit O. Initially let O = {x}. 
Now, for every x G O and every a (z A, compute y = a{x). If y ^ O, add y to O and remove y from 
M. Repeat until M is empty. 

Following this proceedure, we have all of the orbits of G acting on Hn- For each orbit, we will 
fix a representative to use in the bootstrap maximal ensemble algorithm. 

5.2 Bootstrap Maximal Ensemble 

Now that we have k representatives, one from each orbit of group G = (A), we can introduce a 
bootstrap version of the Maximal Ensemble algorithm. In this case, we can compute Equation ([3]) 
once per representative. 

First, we need to partition our representatives according to the set of the emission partition that 
they belong to. Consider the emission partition, {E^l Vx}, and partition the representatives into 
these sets. Also partition Tin according to the emission partition. These two equivalent partitions 
define our initial partitions. 

Now, we can recursively sub-divide the representatives whenever Equation ([3]) is violated. Notice 
that we can compute this equation with x being the representative and u}j is some set of the current 
partition of Tin- Each time we subdivide the partition of the representatives, we need to also 
subdivide the partition of Tin in the equivalent fashion. Suppose that we have representative x 
that we have put into a new set in the representative partition. We obtain the equivalent partition 
of "Hn by creating a new set containing x and all the vectors y € io{x) the orbit of x under the 
action of G. The recursive subdivision continues until the Markov property is satisfied. 

Since the recursive sub-partitioning at minimum splits sets in two, the number of iterations 
required is 0{n). Checking the Markov properties for each iteration requires 0{k2'^) time where k 
is the number of representatives, since we have to check the A; x 2" matrix of distances, or sums of 
distances, between partition elements. So, the total running time is 0(n/c2"). 

6 Running Times 

Notice that the naive calculation of Equation ([T]) requires 0{k2^) time where A: < 2" is the number 
of sets in the partition and n is the number of meioses in the pedigree. The calculation is as 
follows, for each set Wi in the partition, choose a representative x (zWi. For each of the sets in the 
partitions Wj, compute the transition probability Pr[Xt+i € Wj\Xt = x]. This last step seems to 
require enumeration of the inheritance paths. 
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The running time of the state-space reduction is the running time of the ensemble algorithm 
and the running time of the transition calculation. It is interesting to note that calculating the 
transition probabilities in Equation [1] is faster than the HMM forward-backward algorithm having 
runnmg time 0(m22"). This means there is potential to improve the state-space reduction running 
time, if there is a more efficient maximal ensemble algorithm. 

Regardless of whether the over all running time of the state-space reduction is determined 
by calculating the transition function or the ensemble states, all the algorithms here produces 
savings when the forward-backward algorithm is run. This is because a k-set partition of the states 
results in the forward-backward algorithm having 0{mk) running time where m is the number of 
sites. Furthermore, since the original state space has an 0(m2^") forward-backward algorithm and 
the ensemble algorithm is 0(n2^"), the ensemble algorithm is more efficient when n < m which 
is typically the case. The bootstrap algorithm is even more efficient having a running time of 
0(?iA;2"). 

7 Simulation Results 

We simulated pedigrees under a Wright-Fisher model with monogamy where each pair of monoga- 
mous individuals has Poisson distributed number of offspring. There are n individuals per gener- 
ation and A is the mean number of offspring per monogamous pair. The individuals of interest, /, 
are the extant individuals, i.e. those in the most recent generation or, equivalently, the nodes with 
out-degree zero. These pedigrees have no inter-generational mating due to how the Wright-Fisher 
model is defined. To get a half-sibling pedigree, each edge of the pedigree had 50% chance of have 
a new parent drawn at random. Since monogamy was not preserved during this random process, 
the resulting pedigree had half-siblings. 

Running the simulation process and the maximal ensemble algorithm 100 times produced Fig- 
ure [71 The maximal ensemble algorithm produces exponential reductions in the size of the state- 
space. Whether the relationships have half-siblings seems not to influence the practical applicability 
of the maximal ensemble algorithm (data not shown). 

In practice, the maximal ensemble algorithm seems limited to pedigrees of roughly 14 meioses 
while the bootstrap maximal ensemble algorithm seems limited to about 18 meioses. Of course, 
both methods yield the same reduced state space. Given the practical success of the bootstrap 
maximal ensemble algorithm, we recommend that the bootstrap maximal ensemble algorithm be 
employed for state-space reduction. 

8 Discussion 

Even though past efforts at state-space reduction have focused on finding groups of isometrics on 
the edges of the pedigree graph, it is clear that this is an equivalent problem to finding the optimal 
partition of the state space that respects the Markov property. Although the paper mostly discusses 
the pedigree state-space, the maximum ensemble algorithm is general to any HMM. 

Even if some isometrics can be obtained efficiently, for example the founder and chain isometrics, 
computation of the transition probabilities according to Equation [T] seems to require enumeration of 
the inheritance vectors. The naive algorithm requires 0(/c2") where k is the number of orbits and n 
is the number of meioses in the pedigree. Due to this fact, and the fact that the forward-backward 
algorithm for pedigree HMMs has running time 0(m22"), it is an advantage to use exponential 
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Maximal Ensembles for 3-Gen. Pedigrees 




1 r 

5e+02 



1 r 

5e+03 



1 r 

5e+04 



Input State Space 



Figure 3: Maximal Ensemble Algorithm Results. The y-axis is the original size of the state 
space, and the x-axis give the number of ensemble states produced by the maximal ensemble 
algorithm. All of the simulated pedigrees had three generations and Poisson mean A = 2. One 
hundred simulation replicates had n = 4. 

algorithms to find the maximal state-space reduction. Indeed, the maximal ensemble algorithm we 
introduce here has running time 0{n2'^^) which yields more efficient HMM algorithms when n < m 
where n is the number of meioses in the pedigree and m is the number of sites. 

In addition to introducing the maximal ensemble algorithm, we introduced a bootstrap maximal 
ensemble algorithm which runs in 0{nk2^) where k is the number of orbits of the bootstrap isometry 
group. This allows our algorithm to take advantage of know isometries such as the founder and 
chain isometries. 

It would appear that there might be an 0(2^") algorithm for the maximum ensemble problem. 
This can be seen by the looking at the for loop of Algorithm [1) Bipartition that says "foreach 
X £ Aq do". This could easily be changed to "foreach As and foreach x £ As do". However, this 
algorithm appears to require sorting the sets in the emission partition in increasing order by size. 
We do not consider the details of this improved algorithm due to space considerations. 

In practice, the maximal ensemble algorithm obtains exponential reductions in the state-space 
required for an HMM likelihood calculation. The algorithm operates on up to about 14 meioses. 

There are several open problems of interest. First, the computational complexity of the max- 
imum ensemble problem is open. Second, an open problem is the computational complexity of 
finding the transition rates after having determined the partition of the state space. Although 
naive algorithms are exponential, it is unclear whether there are approximation algorithms or 
polynomial-time algorithms for special cases. 

Another very interesting direction is approximation algorithms where instead of guaranteeing 
equality in Equation ([3]), we could allow for bounded inequalities. Let be the approximate Markov 
chain and Xt be the original Markov chain. The idea is that a bound on the inequality for the 
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transition probabilities of Yj would allow for a larger reduction in the state-space. In addition, we 
would hope that the bound on the inequality would guarantee that the deviation of Yj's stationary 
distribution is bounded relative to the stationary distribution of Xt. 



Acknowledgments. Many thanks go to Yun Song for suggesting the problem and to Eran 
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